Natural Language and Image-Based Search Support for Recordings by WANGXIAOMIN-HIK · Pull Request #603 · onvif/specs

WANGXIAOMIN-HIK · 2025-06-26T08:10:23Z

To enhance ONVIF's search capabilities, the following operations have been added to support natural language and image-based search for video recordings:

FindImagebyNL
Purpose: Starts a search session using natural language descriptions to locate relevant video recordings. Example Query: "Person wearing a red hat."
Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search. Text: Natural language description for the search. Likelihood: (Optional) Similarity threshold for the search (0~1). MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response:
SearchToken: A unique reference to the search session. GetNLSearchResults
Purpose: Retrieves results from a natural language search session initiated by FindImagebyNL. Parameters:
SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. FindImagebyImage
Purpose: Starts a search session using a target image to locate relevant video recordings. Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search. TargetImageURI: URI of the target image to be searched. MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response:
SearchToken: A unique reference to the search session. GetImageSearchResults
Purpose: Retrieves results from an image-based search session initiated by FindImagebyImage. Parameters:
SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. Schema Updates
onvif.xsd:

Added complex types for FindImageResult and FindImageResultList to support result structures for both natural language and image-based searches. Included fields like TargetImageURI, Time, Likelihood, and RecordingToken. search.wsdl:

Defined operations FindImagebyNL, GetNLSearchResults, FindImagebyImage, and GetImageSearchResults. Added request and response elements for each operation. Documentation Updates
RecordingSearch.xml:
Added detailed descriptions for FindImagebyNL and GetNLSearchResults operations, explaining their purpose, parameters, and responses.

To enhance ONVIF's search capabilities, the following operations have been added to support natural language and image-based search for video recordings: FindImagebyNL Purpose: Starts a search session using natural language descriptions to locate relevant video recordings. Example Query: "Person wearing a red hat." Parameters: StartPoint: Start time for the search. EndPoint: End time for the search. RecordingToken: (Optional) Token for the recording to search. Text: Natural language description for the search. Likelihood: (Optional) Similarity threshold for the search (0~1). MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response: SearchToken: A unique reference to the search session. GetNLSearchResults Purpose: Retrieves results from a natural language search session initiated by FindImagebyNL. Parameters: SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response: ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. FindImagebyImage Purpose: Starts a search session using a target image to locate relevant video recordings. Parameters: StartPoint: Start time for the search. EndPoint: End time for the search. RecordingToken: (Optional) Token for the recording to search. TargetImageURI: URI of the target image to be searched. MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response: SearchToken: A unique reference to the search session. GetImageSearchResults Purpose: Retrieves results from an image-based search session initiated by FindImagebyImage. Parameters: SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response: ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. Schema Updates onvif.xsd: Added complex types for FindImageResult and FindImageResultList to support result structures for both natural language and image-based searches. Included fields like TargetImageURI, Time, Likelihood, and RecordingToken. search.wsdl: Defined operations FindImagebyNL, GetNLSearchResults, FindImagebyImage, and GetImageSearchResults. Added request and response elements for each operation. Documentation Updates RecordingSearch.xml: Added detailed descriptions for FindImagebyNL and GetNLSearchResults operations, explaining their purpose, parameters, and responses.

…search Updated document and WSDL definitions to allow multiple recording tokens to be passed in in a search operation to query multiple recordings at the same time.

venki5685 · 2025-07-13T06:18:00Z

wsdl/ver10/search.wsdl

+						<xs:element name="EndPoint" type="xs:dateTime">
+							<xs:annotation><xs:documentation>End time for the search.</xs:documentation></xs:annotation>
+						</xs:element>
+						<xs:element name="RecordingToken" type="tt:RecordingReference" minOccurs="0" maxOccurs="unbounded">


add maxOccurs="unbounded" which will allow more than one recording container to search.

Yes, thank you for your opinion. In the last meeting, one of the judges raised the desire to support the search of multiple recording container。

venki5685 · 2025-07-13T06:18:47Z

wsdl/ver10/search.wsdl

+						<xs:element name="EndPoint" type="xs:dateTime">
+							<xs:annotation><xs:documentation>End time for the search.</xs:documentation></xs:annotation>
+						</xs:element>
+						<xs:element name="RecordingToken" type="tt:RecordingReference" minOccurs="0" maxOccurs="unbounded">


add maxOccurs="unbounded" which will allow more than one recording container to search.

venki5685 · 2025-07-13T06:21:14Z

wsdl/ver10/search.wsdl

+							<xs:annotation><xs:documentation>This element contains a list of recording tokens to search.</xs:documentation></xs:annotation>
+						</xs:element>
+						<xs:element name="TargetImageURI" type="xs:anyURI">
+						<xs:annotation><xs:documentation>The target image to be searched in LocalStorage URI format.</xs:documentation></xs:annotation>


how client gets this local storage URI to search?

Yes, thank you for your opinion. the TargetImageURI is the result returned from SearchImageByNL.

cannot we use SearchImagebyImage Request with out getting result from SearchImageByNL? I feel SearchImagebyImage and SearchImageByNL are independent search sessions, i.e one is image based search and other is text based search.

@WANGXIAOMIN-HIK Yes I agree with @venki5685. I feel there shhould not be a dependence on either API!.

thank you for your opinion. @venki5685 @kieran242
We think about it carefully, SearchImagebyImage and SearchImageByNL are independent search sessions.
TargetImageURI can be a local URI or a remote URI.

venki5685 · 2025-07-13T06:21:59Z

wsdl/ver10/search.wsdl

+			</xs:element>
+
+			<!-- Define FindImagebyImage -->
+			<xs:element name="FindImagebyImageRequest">


FindImagebyImage name can be revisited.

Yes, thank you for your opinion. We change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL.

Please pay attention to #603 5b48ba2, the other two records are bad commits (dff6a6f and 57a89d5).

1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL. 2.memo the TargetImageURI is the result returned from SearchImageByNL.

kieran242 · 2025-09-03T15:14:31Z

@WANGXIAOMIN-HIK

Is this functionality aimed at a Camera device or Network Video Recorder or Both?
How is the device supporting NL?

WANGXIAOMIN-HIK · 2025-09-04T08:19:34Z

Yes, thank you for your opinion. @kieran242

Yes, Both. this functionality aimed at a Camera device and Network Video Recorder.
To implement this function on the camera device, the camera device needs to have storage space for images, such as installing a SD card.

The device implements the following algorithm，algorithm leverages massive annotated image-text pairs during training, where visual features (e.g., "dog", "snow", "yellow” ,“fur") and textual elements (e.g., tokenized "snow/field/dog") are extracted through cross-modal neural networks, forming the foundation of its text-to-image retrieval model. In deployment, the system processes video streams to detect targets, then employs on-device models to generate and store binary-encoded feature vectors for rapid matching.

kieran242 · 2025-09-04T11:23:16Z

@WANGXIAOMIN-HIK very kind thanks for your response. It was very informative.

svefredrik · 2025-09-09T14:55:51Z

doc/RecordingSearch.xml

+    </section>
+
+    <section>
+      <title>SerachImagebyImage</title>


SerachImagebyImage -> SearchImagebyImage

Thank you for your clear feedback. I have now fixed the issue.I appreciate your help.

kieran242

@WANGXIAOMIN-HIK a few minor updates as suggestions wsdl update is good but spelling mistake in doc.

kieran242 · 2025-10-31T13:10:39Z

doc/RecordingSearch.xml

+
+    <section>
+      <title>SerachImagebyImage</title>
+      <para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>


Suggested change

<para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>

<para>SearchImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>

kieran242 · 2025-10-31T13:13:48Z

doc/RecordingSearch.xml

+    </section>    
+    <section>
+      <title>GetImageSearchResults</title>
+      <para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>


Suggested change

<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>

<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SearchImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>

Thank you very much for your advice. I have fixed the spelling error. I appreciate your help.

wsdl/ver10/schema/onvif.xsd

venki5685 · 2025-11-12T02:58:10Z

doc/RecordingSearch.xml

+            <para role="text">The point of time where the search will stop.</para>
+            <para role="param">RecordingToken - optional [tt:RecordingReference]</para>
+            <para role="text">Token for the recording to search.</para>
+            <para role="param">TargetImageURI [xs:anyURI]</para>


add a search parameter to accept external image from client in addition to NPL Target image URI. Either client can use NPLTargetImageURI or External Image from client for image search feature.

We Update the search functionality, add the TargetImageData parameter.

…nd provide detailed explanation of the use of TargetImageURI

kieran242 · 2025-11-13T08:17:53Z

doc/RecordingSearch.xml

-            <para role="param">TargetImageURI [xs:anyURI]</para>
-            <para role="text">The TargetImageURI is the result returned from SearchImageByNL.</para>
+            <para role="param">TargetImageURI - optional [xs:anyURI]</para>
+            <para role="text">The URI of the detected target object image. This can be either: - a local image stored in the NPL Target Image repository (LocalStorage format), or - an external image provided by the client for image search or feature matching.</para>


@WANGXIAOMIN-HIK please add an entry for "NPL" to this documents "Definitions" in section 3.1 to explain that it is "Natural Language Processing". It will add clarity in the document.

… the terminology, as the original meaning refers to the images stored internally on the device.

wsdl/ver10/schema/onvif.xsd

doc/RecordingSearch.xml

wsdl/ver10/schema/onvif.xsd

…ult to FindObjectImageResult.

the description :It represents the cosine similarity between two vectors, which is used to measure the similarity of the directions of the vectors. The closer the value is to 1, the higher the similarity; the closer the value is to 0, the lower the similarity.

HansBusch · 2026-01-12T13:35:49Z

doc/RecordingSearch.xml

+            <para role="text">Token for the recording to search.</para>
+            <para role="param">Text [xs:string]</para>
+            <para role="text">Natural language description for the search.</para>
+            <para role="param">CosineSimilarity - optional [xs:float]</para>


CosineSimularity is just one out of a set of common simularity measures.
I can imagine that ONVIF just defines an abstract simularity bewteen zero and one or a complex item supporting multiple similarity measures.

For the sake of simplicity I prefer the first approach as the second one would require a set of capabilities which ones a device supports.

yes,
Cosine similarity is just one of the most commonly used similarity measures in vector space. In actual image retrieval/similarity assessment, multiple distance measures, matching based on local features, perceptual similarity, as well as learned metrics or hash/quantization indexing are also used.
Could we consider changing the field back to 'simularity', but in the comments, we can note that we are currently using the cosine vector method? @dstafx

Ok for me since this is a big and important topic that we likely need to come back to on a more general level. But to explain a bit more:

My general concern (and the industry challenge) is that if we are too generic it will not be useful across vendors as the implementation scores would not be compareable. The message to the client if it is a generic number is likely that every enpoint where this interface is offered may have different implementations so the similarity is not comparable between different endpoints. So when searching some devices or vendors may consistently report higher similarities thereby potentially hiding relevant results. By stating that the similarity is only for sorting search results from a single endpoint we can avoid this.

Thank you very much for your detailed explanation. After careful consideration, I still think it should be defined as similarity, as we cannot restrict the vendors' implementation algorithms.

However, regarding the issue you mentioned about cross-device and cross-vendor search, we can add a note stating that this similarity is only guaranteed to be effective for ranking within the same search result returned by the same endpoint, and it does not guarantee that similarity can be compared across different vendors, devices, or endpoints.

As for the cross-device search issue, we can discuss it in our next meeting. This would require imposing constraints on the hardware vendors' implementation mechanisms, such as ensuring that devices support the same algorithm scheduling to guarantee consistency in device detection mechanisms.

Don't want to have this blocking

kieran242 · 2026-01-23T11:27:16Z

@WANGXIAOMIN-HIK @dstafx @HansBusch is this issue resolved regarding the "similarity measures" ? I see it is required for IPR| review.

kieran242

@WANGXIAOMIN-HIK approved with discussion on VE WG Call.

sujithhanwha · 2026-02-07T05:18:52Z

It appears that both APIs, GetImageSearchResults and GetNLSearchResults, currently return identical results.

We could either consolidate them into a single unified method for search results,
or,
if we prefer to keep them separate, define a distinct structure for NL search (e.g., FindObjectImageResultList → FindNLResultList) and adjust the response format to differentiate them. For instance, in GetNLSearchResults, the Image field could be optional when the search is based purely on metadata/text inference.

WANGXIAOMIN-HIK · 2026-02-09T06:01:11Z

GetNLSearchResults and GetImageSearchResults are distinguished by their usage scenarios.
GetNLSearchResults is used for searching images based on natural language, retrieving images that closely match the natural language description.
GetImageSearchResults, on the other hand, searches for images using an image as input. It is based on image modeling and can accurately find the corresponding image information.
Users first use GetNLSearchResults for preliminary retrieval, then select the images they want, and use GetImageSearchResults for precise searching.
These two interfaces essentially both search for images, so they adopt the same structure.
@sujithhanwha

sujithhanwha · 2026-02-09T06:17:07Z

GetNLSearchResults and GetImageSearchResults are distinguished by their usage scenarios. GetNLSearchResults is used for searching images based on natural language, retrieving images that closely match the natural language description. GetImageSearchResults, on the other hand, searches for images using an image as input. It is based on image modeling and can accurately find the corresponding image information. Users first use GetNLSearchResults for preliminary retrieval, then select the images they want, and use GetImageSearchResults for precise searching. These two interfaces essentially both search for images, so they adopt the same structure. @sujithhanwha

@WANGXIAOMIN-HIK ,
My question is not about the usage scenarios—I fully agree with those distinctions. It’s purely about the interface design. If both GetImageSearchResults and GetNLSearchResults return the same format, why do we need two separate methods? Could we not unify them into a single method that accepts search token from both image and NL search.

If you believe it makes sense to keep them separate, I’d recommend using different result formats. This would allow each interface to evolve independently. (For example, if we later add parameters specific to image-based search, they wouldn’t automatically apply to natural language search results. )

… of the interface. We have added FindNLSearchResultList and FindNLSearchResult to distinguish the return values of the GetNLSearchResults and GetImageSearchResults interfaces. GetNLSearchResults -> FindNLSearchResultList, FindNLSearchResult GetImageSearchResults -> FindObjectImageResultList, FindObjectImageResult

WANGXIAOMIN-HIK · 2026-02-09T07:19:48Z

Thank you for your suggestion. @sujithhanwha
We have made the following changes considering the future scalability of the interface.
We have added FindNLSearchResultList and FindNLSearchResult to distinguish the return values of the GetNLSearchResults and GetImageSearchResults interfaces.
GetNLSearchResults -> FindNLSearchResultList, FindNLSearchResult
GetImageSearchResults -> FindObjectImageResultList, FindObjectImageResult

WANGXIAOMIN-HIK added 2 commits June 26, 2025 16:08

Expand the recording search feature to support multi-recording token …

c272ef2

…search Updated document and WSDL definitions to allow multiple recording tokens to be passed in in a search operation to query multiple recordings at the same time.

venki5685 reviewed Jul 13, 2025

View reviewed changes

WANGXIAOMIN-HIK added a commit to WANGXIAOMIN-HIK/specs that referenced this pull request Jul 14, 2025

onvif#603 update

dff6a6f

1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL. 2.memo the TargetImageURI is the result returned from SearchImageByNL.

WANGXIAOMIN-HIK added a commit to WANGXIAOMIN-HIK/specs that referenced this pull request Jul 14, 2025

onvif#603 update

57a89d5

1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL. 2.memo the TargetImageURI is the result returned from SearchImageByNL.

#603 update

5b48ba2

1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL. 2.memo the TargetImageURI is the result returned from SearchImageByNL.

svefredrik reviewed Sep 9, 2025

View reviewed changes

kieran242 reviewed Oct 31, 2025

View reviewed changes

WANGXIAOMIN-HIK added 2 commits November 3, 2025 09:41

Correct the spelling mistake : SerachImagebyImage -> SearchImagebyImage

a1eec95

Verander SearchImagebyImage na SearchImageByImage

c0b206a

venki5685 reviewed Nov 12, 2025

View reviewed changes

wsdl/ver10/schema/onvif.xsd Show resolved Hide resolved

venki5685 reviewed Nov 12, 2025

View reviewed changes

ocampana-videotec added the 26.06 label Nov 12, 2025

Update the search functionality, add the TargetImageData parameter, a…

4509b2c

…nd provide detailed explanation of the use of TargetImageURI

ocampana-videotec added IPR needed WG_enh labels Nov 13, 2025

kieran242 approved these changes Nov 13, 2025

View reviewed changes

Update the TargetImageURI description, remove the NPL prefix to unify…

80803e4

… the terminology, as the original meaning refers to the images stored internally on the device.

dstafx previously requested changes Nov 17, 2025

View reviewed changes

wsdl/ver10/schema/onvif.xsd Outdated Show resolved Hide resolved

doc/RecordingSearch.xml Show resolved Hide resolved

wsdl/ver10/schema/onvif.xsd Show resolved Hide resolved

wsdl/ver10/schema/onvif.xsd Show resolved Hide resolved

change Likelihood score to Likelihood threshold , change FindImageRes…

a606279

…ult to FindObjectImageResult.

ocampana-videotec added this to the 26.06 milestone Dec 4, 2025

HansBusch reviewed Jan 12, 2026

View reviewed changes

Change the "CosineSimilarity" field to "Similarity"

0fd61ed

kieran242 approved these changes Feb 5, 2026

View reviewed changes

sujithhanwha approved these changes Feb 9, 2026

View reviewed changes

	<para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>
	<para>SearchImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>

	<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>
	<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SearchImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>

Conversation

WANGXIAOMIN-HIK commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kieran242 commented Sep 3, 2025

Uh oh!

WANGXIAOMIN-HIK commented Sep 4, 2025

Uh oh!

kieran242 commented Sep 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kieran242 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kieran242 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kieran242 left a comment

Choose a reason for hiding this comment

Uh oh!

sujithhanwha commented Feb 7, 2026

Uh oh!

WANGXIAOMIN-HIK commented Feb 9, 2026

Uh oh!

sujithhanwha commented Feb 9, 2026

Uh oh!

WANGXIAOMIN-HIK commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

kieran242 commented Jan 23, 2026 •

edited

Loading